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A NEW CONVERGENCE ANALYSIS AND PERTURBATION 
RESILIENCE OF SOME ACCELERATED PROXIMAL 
FORWARD-BACKWARD ALGORITHMS WITH ERRORS 

DANIEL REEM AND ALVARO DE PIERRO 


Abstract. Many problems in science and engineering involve, as part of their solution 
process, the consideration of a separable function which is the sum of two convex func¬ 
tions, one of them possibly non-smooth. Recently a few works have discussed inexact 
versions of several accelerated proximal methods aiming at solving this minimization 
problem. This paper shows that inexact versions of a method of Beck and Teboulle 
(FISTA) preserve, in a Hilbert space setting, the same (non-asymptotic) rate of con¬ 
vergence under some assumptions on the decay rate of the error terms. The notion of 
inexactness discussed here seems to be rather simple, but, interestingly, when compar¬ 
ing to related works, closely related decay rates of the errors terms yield closely related 
convergence rates. The derivation sheds some light on the somewhat mysterious ori¬ 
gin of some parameters which appear in various accelerated methods. A consequence 
of the analysis is that the accelerated method is perturbation resilient, making it suit¬ 
able, in principle, for the superiorization methodology. By taking this into account, we 
re-examine the superiorization methodology and significantly extend its scope. 


1. Introduction 

1.1. Background: Many problems in science and engineering involve, as part of their 
solution process, the consideration of the following minimization problem: 

inf{F(a;) : x G H}. (1) 

Here F is a separable function of the form F = / + (?, both / and g are convex func¬ 
tions dehned on a real Hilbert space H (with an inner product (•, •) and an induced norm 
II • II), the function g is lower semicontinuous and possibly non-smooth, and / is continu¬ 
ously differentiable and its derivative f is Lipschitz continuous with a Lipschitz constant 
L{f') > 0. A typical scenario of (1) appears in linear inverse problems [35,42]. There 
H = M"’, b G f{x) = ||Aa; —6||^ for some mxn matrix A, g{x) = A||La;|p, and L is an 
m X n matrix (often L is the identity operator, or a diagonal one, or a discrete approxi¬ 
mation of a differential operator). The dimensions m and n are large, e.g., on the order of 
magnitude of 10^, and A is a hxed positive constant (the regularization parameter). The 
goal is to estimate the solution x G M"' to the linear equation 

Ax = b + u, (2) 
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where u G is an unknown noise vector. The solution x frequently represents an image 
or a signal and the consideration of ( 1 ) instead of ( 2 ) is motivated from the fact that 
(2) is often ill-conditioned. 

The — ^2 minimization problem (or closely related variations of it) is a variation of 
the previous problem which has become popular in machine learning, compress sensing, 
and signal processing [4,17,24,87]. Here one frequently takes g{x) = A||Lx||i or g{x) = 
Xlyen 11 I loo where ||a:||i = |a;i|, H is a vector of positive integers p whose sum is 

'n, llxy||oo = maxdxjl : j G {1,..., p}}, and Aj/ > 0 for all components v of H. The non¬ 
smooth terms are used for increasing sparsity. As a final example we mention the nuclear 
norm approximation minimization problem which has several versions (and it includes, as 
a special case, the minimum rank matrix completion problem). In one version x G 
f{x) = \\Ax — fe|p, A : R^ is linear, b G R^, g{x) = A||a;||nuc, and ||x||nuc is 

the nuclear norm of x, i.e., the sum of singular values of x where here x is viewed as a 
matrix [16,60] (the nuclear norm is aimed at providing a convex approximation of the 
matrix rank function). In a second version a; G R"', / : R” ^ R is a quadratic function, 
g{x) = \\Ax — H||nuc, A : R"' —)■ R^^'^ is a linear mapping, and B G R^^'^ is a given 
matrix [59]. As discussed in the previous references and in some of the references therein, 
this problem has applications in control and system theory, compressed sensing, computer 
vision, data recovering, and more. 

Proximal (gradient) methods are among the methods used for solving (1). Roughly 
speaking, they have the form 

Xfc = argmin,^g^Qfc(a:, 2 /fc) (3) 

where Qk '■ —)■ R is a sum of a two-variable quadratic function and of g and it depends 

on the iteration k, on /, and possibly on some other parameters, and yk depends linearly 
on previous iterations. When = Xk-i, a convergence of the iterative sequence {xk)'^^i to 
a solution x* of ( 1 ) (assuming such a solution exists) can be established, but unless some 
strong conditions are imposed on F and/or other components involved in the problem 
(e.g., properties of the solution set), both the asymptotic convergence {xk - > x*) and 

fc^OO 

the non-asymptotic one {F{xk) - > F{x*)) can be slow, e.g., F{xk) — F{x*) = 0{l/k). 

k^oo 

See, for instance, the discussions in [9] about the ISTA method (Iterative Shrinkage 
Tresholding Algorithm), and in [15, Chapters 4-5], [26] about related generalizations and 
variations. 

The above disadvantage is one of the reasons why accelerated proximal gradient meth¬ 
ods are of interest, methods in which a non-asymptotic rate of convergence of the form 
F{xk) — F[x*) = 0(l//c^) can be achieved. The first significant achievements in this 
area seem to be the works of Nemirovski and Yudin [65, Chapter 7] (1979), and Ne- 
mirovski [64] (1982) (with ideas which go back to their 1977 paper [94]), for the case of 
certain smooth functions (i.e., 5 ^ = 0) in a certain class of smooth real reflexive Banach 
spaces. However, their methods were rather complicated. A breakthrough occurred some 
time later (1983) by Nesterov [ 66 ], who presented a simple and very practical acceler¬ 
ated method for the case g = Q and F defined on a Euclidean space. A few years ago 
there have been additional signihcant achievements when the case oi F = f + g with 
a non-smooth g has been discussed in a Euclidean space setting by Nesterov [69, Sec¬ 
tion 4] in 2007 and Beck and Teboulle [9] in 2009. Both papers improved independently 
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(and using different approaches) Nesterov’s method [66] using clever modihcations. Beck 
and Teboulle called their method FISTA (Fast Iterative Shrinkable Tresholding Algo¬ 
rithm). In the accelerated methods Uk is not Xk-i but a linear combinations of several 
previous iterations. For instance, in FISTA Hk+i = Xk + I3k{xk — Xk-i) and in Nes¬ 
terov’s method t/k+i = PkZk + (1 ~ /3k)xki where Zk is a minimizer of a one-variable 
quadratic function and f3k is a positive parameter. For related accelerated methods, see 
e.g., [3,8,10,39,40,55,63,67,68,89,90,101]. 

A natural question regarding these accelerated methods is whether they are perturba¬ 
tion resilient. In other words, are they stable, i.e., do they still exhibit an accelerated rate 
of convergence despite perturbations which may appear in the iterative steps due to noise, 
computational errors, etc. The relevance of this question becomes even more evident when 
taking into account the fact that the iterative step in these method involves a proximity 
operator (see (3) and (6)) whose computation is likely to be inexact, since it is itself a 
solution to a minimization problem. Because of that, there has been a rather wide related 
discussion on inexact proximal forward-backward methods as the following partial list of 
references shows: [1,11,12,25-27,34,41,43,47-49,53,54,70,79-81,83-85,93,96]. In 
these papers various notions of inexactness and various settings are discussed (however, 
in many cases the methods are non-accelerated, the functions are non-separable, and no 
convergence estimates are given). 

Another motivation to discuss inexactness in relation to proximal methods is the re¬ 
cent optimization scheme called “superiorization” [19,28,44]. In this scheme ones uses 
carefully selected perturbations in an active way in order to obtain solutions which have 
some good properties, properties which are measured with respect to some auxiliary cost 
function (or energy/merit function). For instance, if one wants to minimize a given func¬ 
tion under some constraints, then instead of solving this problem which might be too 
demanding, one may try to hnd a point which satishes the constraints but is not neces¬ 
sarily a minimizer. Instead, this point will have a low cost function value and hence it 
will be superior to other points which satisfy the constraints. See Section 4 for a more 
comprehensive discussion and many more related references. 

To the best of our knowledge, the issue of inexactness related to accelerated proximal 
forward-backward methods with a separable function F = f + g has been considered only 
in the following papers: Devolder et al [33], Jiang et al. [50], Monteiro and Svaiter [62], 
Schmidt et al [82], and Villa et al. [92] (the latter is the only work where H is allowed to 
be inhnite dimensional). In these works (3) is replaced by 

Xk ~ argmin,^g^Qfc(x,i/fc), (4) 

where the approximation depends of the perturbation terms and the notion of inexact¬ 
ness (4) depends on the paper. 

In [82] the inexactness (4) means that Qk{,Xk) < (-k + Qkiy'k) where > 0 is given, |/(, 
is a solution to an approximate quadratic minimization problem depending on previous 
iterations, and Qk is a perturbed version of Qk obtained by perturbing the gradient of 
the quadratic term of Qk by a given error vector. In [50, p. 1046] the authors consider 
a different approximation notion. Now (4) means that F{xk) < Qk{xk) + {^k/{‘2tl)) and 
< efc/(\/24), where 4 := f'iVk) + Ak{xk - Vk) + '^k, Ak ■. H ^ H is some 
positive dehnite linear operator, > 0 is a parameter dehned recursively (see (9) below). 
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and efc are given positive parameters, and 7 ^, G Here, as usual, for a given 

e > 0 the e-subdifferential of g is 

d^g{z) = {u E H : g{z) + {u,x — z) < g{x) -f e, \/x E H}. 

When Cfc = 0, then (5 a; = 0 and (3) is obtained. 

The notion of inexactness of [92] (see [92, Dehnition 2.1], [92, Theorem 4.3]) is also 
related to the e-subdifferential: given an estimate parameter > 0 , the approximation 
(4) holds if and only if {yu- >^kf'{Vk) -Xk)/\k e ^^ 2 J^ 2 x^^){xk), where Afc G (0,2/L(/')] is a 
relaxation parameter. When e*, = 0 then (3) holds due to the optimality condition with 
Qk- In [62] there is a discussion and general results which allow inexactness, e.g., [62, 
Sections 3-4]. However, the application of these results for the setting of a separable 
function, namely, [62, Algorithm I], is actually without inexactness. 

Finally, in [33] (see especially Dehnition 1 and the properties after it. Algorithm 3, and 
Subsection 8.2) this notion is related to the concept of an inexact hrst order oracle of a con¬ 
vex function called a (5, L)-oracle in [33]. Here (4) means that Xk = argmin^g(^Qfe(^) l/fc) 
where C* is a hxed closed and convex subset of H (the minimization is done over C instead 
of over H) and Qk is a quadratic upper bound on F which coincides with F and yk- It 
is obtained from a modihcation of Qk by replacing in Qk the coefficient 0.5L(/') of the 
quadratic term 0.5L(/')||a; — ykW'^ by 0.5L := 0.5(L(/') -|- (1/(2(5))M^). Here 5 := 5k is 
an error term and M > 0 is an upper bound on the variation of the subgradients of g 
over C. Because one assumes that M is hnite, C usually cannot be unbounded. As noted 
in [33, p. 48], the parameter 5 does not represent an actual accuracy and it can be chosen 
as small as one wants at the price of having a larger L, i.e, a worse quadratic upper bound 
on F. 

1.2. Contribution: We consider two inexact versions (constant step size rule and back¬ 
tracking step size rule) of FISTA and show that FISTA is perturbation resilient in the 
function values, namely, it still converges non-asymptotically despite a certain type of 
perturbations which appear in the algorithmic sequences. The notion of inexactness we 
consider is of the form 

Xk = ek + argmin^g^( 5 fc(a:,|/fc), 

which seems to be rather simple comparing to notions considered in previous mentioned 
works. Such a notion of inexactness is closely related to notions considered by, for instance, 
Combettes-Wajs [26, Theorem 3.4], Rockafellar [79, Theorem 1], and Zaslavski [95, The¬ 
orem 1.2] in a different context (non-accelerated proximal methods). Depending on the 
rate of decay of the magnitude of the perturbations e*, to zero, either the original 0 (l//c^) 
convergence rate is preserved or a slower one is obtained. Interestingly, despite the dif¬ 
ference in the notion of inexactness and in the algorithmic schemes, the rate of decay we 
obtain is closely related to other schemes (Corollaries 3.7-3.8 and Remark 3.10 below). 
We allow the ambient space H to be inhnite dimensional, as in [92] (and [80,89], which, 
however, do not consider inexactness) but not elsewhere. Unless the perturbations vanish, 
we require g to be hnite. This is a somewhat stronger condition than in several previous 
works in which g was allowed to attain the value 00 , but it is more general than the orig¬ 
inal paper of Beck and Teboulle [9]. In contrast to previous works on inexact accelerated 
methods, we allow the case F{x) = —00 and we do not require the optimal set to 

be nonempty (for the exact case, only [80,89,90,92] allow this latter case). 
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Our analysis is motivated by [9], but a few significant differences exist, partly because of 
the presence of perturbation terms and the infinite dimensional setting. An interesting by¬ 
product of our analysis is the derivation, in a systematic way, of the parameters involved 
in FISTA, parameters whose source seems to be a mystery. For instance, in all previous 
works which discuss accelerated proximal gradient methods, one uses the auxiliary variable 
Hk and assumes an explicit linear dependence of it on previous iterations (see (3) above 
and the discussion after it). The variable yk is assumed to depend on positive parameters 
... which satisfy a certain relation (e.g., (9) below), but no systematic method is 
presented which explains the mysterious origin of both and 1^. it seems that initially 
they were guessed, and in later works they or slight variants of them were used directly 
without shedding light on their origin. In our analysis we do not impose in advance any 
form on y^ or but rather derive them explicitly during the proof (until late stages in 
the proof we only require the existence of yk satisfying (3) without any relation to tk 
whose existence is even not assumed). After deriving our ideas, we have become aware of 
the works of Tseng [89, Proof of Proposition 2], [90, Proof of Theorem 1(b)] which also 
shed some light on the origin of tk and yk (in the exact case). However, his analysis is 
different (but not entirely different) from ours. 

As said in Subsection 1.1, the superiorization methodology is one of the reasons to 
consider the question of perturbation resilience in the context of FISTA. Our hnal con¬ 
tribution in this paper is to re-examine this methodology in a comprehensive way and to 
signihcantly extend its scope. 

1.3. Paper layout: Basic assumptions and the formulation of inexact versions of FISTA 
are given in Section 2. The convergence of the iterative schemes are presented in Section 
3, as well as several corollaries and remarks related to the convergence theorem (mainly 
regarding the rate of decay of the error terms and the function values), including a com¬ 
parison with related papers. The superirization methodology is re-examined and extended 
in Section 4. The proofs of some auxiliary claims are given in the appendix (Section 5). 

2. Basic assumptions and the eormulation oe FISTA with perturbations 

2.1. Basic assumptions: From now on if is a given real Hilbert space with an inner 
product (•, •) and an induced norm || ■ ||. We define F : H ^ {—oo, oo] hy F := f + g 
where / : ii —)■ M is a given convex function whose derivative f exists and is Lipschitz 
continuous with a Lipschitz constant L{f') > 0, i.e., ||/'(a;) — f'{y)\\* < L\\x — y\\ for all 
x,y ^ H where || • ||* is the norm of the dual H* of H. We assume that g is a. given 
convex and lower semicontinuous function from H to (—cxo, cxo] which is also proper, i.e., 
its effective domain dom( 5 f) := {x E H : g{x) < cxo} is nonempty. 

2.2. The definition of the accelerated scheme: The scheme has two versions: a con¬ 
stant step size version and a backtracking version. The constant step size version with 
perturbations is dehned as follows: 

Input: a positive number L > Lif). 

Step 1 (initialization): arbitrary Xi E H, y 2 E H, t 2 > 1- 
Step k, k >2: Let Lk := L 


Xk — PLkiVk) + Cfc, 


( 5 ) 
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where G -hT is the error term, 

PL^iy) ■= argmin{a; G H : 

Ql,{x, y) := f{y) + {f{y),x -y) + 0.5Lk\\x -y\\^ + g{x), 
yk+l Xk “f" 


^fc +1 


^k+l 

1 + ,/TTu] 


^k—1 I 1 


( 6 ) 

( 7 ) 

( 8 ) 

( 9 ) 


The backtracking step size version with pertnrbations is dehned as follows: 

Input: Li > 0, r; > 1. 

Step 1 (initialization) arbitrary Xi e H, y2 e H, t2 > 1. 

Step k, k > 2: Find the smallest nonnegative integer ik snch that with Lk := rj^^Lk-i 
we have 

F{pL^{yk)) < QLdPL^iyk),yk)- ( 10 ) 

Now let 

Xk = PLkiyk) + ek, ( 11 ) 

4-1 


yk-\-i — Xk + 


tk-\-l 


^k ^k—1 


( 12 ) 


where tk+i is defined in (9). In both versions the error terms are arbitrary vectors in H 
satisfying a certain adaptivity condition which is presented later (Snbsection 2.3 below) 
and depends on the bonndedness of F on a certain ball with center Xk- As is well-known, 
the minimizer of x i-A- QLki.x,yk) exists and is unique [ 6 , Corollary 11.15]. Thus PLkiPk) 
and Xk are well defined. 


Remark 2.1. We note that the backtracking step size rule is well-defined because accord¬ 
ing to a well-known finite dimensional result [67, Lemma 1.2.3, pp. 22-23], whose proof 
in the inhnite dimensional case is similar (see Lemma 5.1 in the appendix), if / : iL —)■ M 
is continuously differentiable with a Lipschitz constant L{f') of /', then 

fix) < fiy) + {fiy),x -y) + 0.5L\\x - yf, Wx,yEH,WL> L(/'). (13) 

By adding gix) to both sides of (13) and using the representation F = f + g, we con¬ 
clude that Fipiiyk)) < QLiPLiyk),yk)- Since for large enough 4 (and, obviously, also 
in the constant step size rule) we will have Lk > L(/'), the above implies Fipi^iyk)) < 
QLkiPLkiyk),yk)- However, it may happen that FipL^iyk)) < QLuiPLkiyk),yk) even when 
Lk < Lif). 

In addition, the minimization in (6) can be done over the effective domain of g. It 
follows that QLiPLkiyk),yk) is always finite for all k > 2. Therefore, if = 0, then Fixk) 
is hnite because in this case the argmin in ( 6 ) is attained at Xk = Pikiyk) and from the 
previous paragraph we have F(xfc) < QLkixk,yk) for all k>2. 

Remark 2.2. There is a certain delicate point regarding the backtracking step size ver¬ 
sion; in many cases computing both sides in Fip^^iyk)) < QLkiPLkiyk),yk) is not accurate 
because PLtiPk) is known only up to an error e^,, namely, one actually is able to compute 
only Xk- Thus, unless we have an exact expression for pitiPk), we actually check whether 
F{xk) < Qhkixk.yk)- The Lk for which Fixk) < Qxkixk^yk) holds may not satisfy 
FipLkiyk)) < QLkiPLkiyk),yk)- So we need to find a simple condition which ensures that 
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if F{xk) < Qh'^ixkyVk) holds for some L'^, then < QLk{PLkiyk),yk) for some 

explicit positive number L^. In the constant step version there is no problem assuming we 
can evaluate L{f') from above, since in this case we can take Lh to be any positive upper 
bound on L{f'). The problem is with the backtracking step size version, unless = 0. 

Remark 2.3. The construction of and (13) imply that 

P ^ Lk < T, Vfc>l (14) 

for some positive numbers p < t. Indeed, if Li < L{f'), then (14) holds with p := Li 
and r := 7jL{f’). If Li > L{f'), then (14) holds with p := Li =: r. 


2.3. The condition on the error terms. In the following lines a condition on the error 
terms will be presented, namely (17). For a variation of this condition, see Remark 2.6 
below. Let x E H he such that F{x) < oo (there exists such an x since F is proper) and 
let Si > 0 be fixed (for all k). Let p > 0 be a fixed upper bound on ||x||. Let (sfc )^2 ^ 
sequence of arbitrary nonnegative numbers. Denote by B[xk, 2si] the closed ball of radius 
2 si and center x^- If F is bounded on R[a:fc,2si], then let irik and be any lower and 
upper bounds of F on B[xk,2si\, respectively, satisfying m*. < M^. Define 




Mk — ruk 

Si 

0 


if F is bounded on B[xki 2si], 
otherwise. 


(15) 


For all /c > 2, let 


(Jk := 2tl ((l/Lfc)Afc + \\pLu{yk)\\ + |bL,_i(|/fc-i)|| + 4si + {l/tk)p) , (16) 

where (i/i) := Xi. Then for all /c > 2 the error term Ck is any vector in H which satisfies 
the following condition: 


J min {si, Sfc/cTfc} , if F is bounded on 2si], 
( 0 otherwise. 


( 17 ) 


Here are a few remarks regarding Condition (17). 


Remark 2.4. First, if F is bounded on the considered ball, then g must be bounded there 
because / is always bounded on balls (follows from the fact that /' is Lipschitz continuous). 
When H is finite dimensional and g is continuous (as happens in many applications: see 
e.g.. Section 1 ), then the boundedness of g is automatically ensured because closed balls 
are compact so classical theorems in analysis can be used. If however g attains the 
value oo, as happens when g is an indicator function of a closed and convex subset, then 
we must require the error term to vanish if H is finite or infinite dimensional. In 
the infinite dimensional case there is another complication, since then there are exotic 
cases [5, Example 7.11, p. 413] in which g may be unbounded on closed balls even if it 
is continuous and does not attain the value oo. However, for most applications (e.g., the 
infinite dimensional versions of the examples given in Section 1) this does not happen. 

Remark 2.5. Condition (17) implies the dependence of on previous iterations. Hence 
this condition can be regarded as being adaptive or relative. Conditions in this spirit have 
been dealt with in the literature in [22]. In [34,37,48,49] one can find related but more 
implicit relations. In other places, e.g., [25,26] there is no such dependence, but rather 
the error terms should be summable. In previous works dealing with inexact accelerated 
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methods in the context of separable functions [33,50,82,92] the error terms are assumed 
to decay fast enough to zero by imposing a pure numerical quantity which bounds their 
magnitude (or the sum of their magnitudes) from above. 

Remark 2.6. It can be argued that (17) is not explicit enough for two reasons. First, 
X is sometimes a minimizer (see the formulation of Theorem 3.6 below), so it is not 
known, and hence its upper bound fi is unknown. Second, unless it is known that Ck = 0, 
we usually cannot compute px^iVk), hence we do not know it, but instead we know Xk- 
Therefore it is a problem to compute and to estimate ||efc||. 

Here is an answer to the first concern (see the paragraphs below for the second concern). 
If X is a minimizer, then it is indeed unknown. However, ||x|| can be estimated frequently. 
Assume for instance that F is coercive, i.e., limuj-n^oo-^"(^i) = C) 0 . In particular, there is 
some fi > 0 such that F{x) > F"(0) for all ||a:|| > fi. Since x is a minimizer of F we have 
F{x) < F{0), and so it must be that ||x|| < fi. If for example F{x) = \\Ax — 6||^ + |la;||i, 
X E H ■.= A : R"' —)■ R”'" is linear, ?7,,n' G N, 6 G R”'", then F{x) > ||a;||i > ||a:|| for all 
X E H. As a result, we obtain that F{x) > F{0) = ||6|p whenever ||a;|| > fi := ||&|p. 

As for the second concern, we suggest three ways to overcome the problem. First, it 
is worth noting that there are important situations in which pi^iUk) can be computed 
exactly: one of them is the ii — optimization case, namely, when H = R”, f{x) = 
\\Ax — &||^, A : R” —)■ R"'" is linear with adjoint A* : R"' —)■ R", b E R”'", g{x) = A||x||i, 
since then QLk{x,yk) = f{yk) + {f{yk),x - yk) + 0.5Lfc||x - yk\\^ + A||a:||i. In this case 
one has Phuivk) = Sx/L^iVk - {2/Lk)A*{Ayk - 6 )), where S'„ : R" R" is the shrinkable 
operator which maps the vector x to the vector S'q(x) = (max{|a:j| — a, 0}sign(a:j))”^^ for 
each given a > 0. See, for instance, [9, pp. 185,188], [101, p. 80, Equation (21)]. In 
such cases it is also possible to use the perturbations in an active way, e.g., as a mean 
for enhancing the speed of convergence or for achieving other purposes, as done in the 
superiorization scheme (see Section 4 below). 

A second way to overcome the second concern can be used in the frequent case where 
PLkiyk) can be computed only approximately. In this case, given k > 2 we can replace 
(17) by the following condition: 


J min {si, Sfc/(T(,} , if F is bounded on ^[xfc, 2 si], 
^ 0 otherwise. 


( 18 ) 


where 

cr(, := 2tl ((l/Lfc)Afc + \\xk - ((4 - l)/tk)xk-i\\ + 2si + (1/4)/^), (19) 

and our convergence results still remain correct due to (50) below. For applying (18) 
in practice we first approximate pikiUk) up to some arbitrary small parameter e > 0 . 
Some examples are mentioned in [92] (this follows from [80, Proposition 2.5] and the 
discussion after [80, Definition 2.1]); see also some of the examples in [9,50]. What we 
obtain is a point Xk for which we know that \\xk — PLk{yk)\\ < e- Now we check whether 
e < min {si, Sfc/cr(,}. If yes, then for sure ek ■= Xk — PL^iVk) satisfies (18). Otherwise, we 
continue to approximate PLkiVk) using a smaller parameter, say 0.5e, and calling it again 
e (of course, k is fixed during this process). Eventually the inequality e < min{si, 
will be satisfied since (t(, > > 0 . 
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The third way to overcome the second concern is to bound from above ah by some 
explicit parameter ah- Then we can take any vector eh E H satisfying 


•lle.ll < 


min {si, Sfc/dfc} , if -F is bounded on B[xh, 2si], 
0 otherwise. 


( 20 ) 


Such Cfc will satisfy (17) too. It remains to estimate ah from above. As follows from [6, 
Proposition 11.13, p. 158], the function u: H ^ (— cxo,cxo] dehned by u{x) := Qh^i^^yk) 
for each x E H is supercoercive (i.e., \mix^oou{,x) /\\x\\ = cxo) because it is a sum of a 
quadratic (hence supercoercive) function and a convex, proper and lower semicontinuous 
function. Consequently there exists Vh large enough such that for all x satisfying ||(r|| > i>h 
we have, in particular, that u{x) > m( 0). Since pi^iVk) is the minimizer of u we conclude 
that u{x) > m(0) > u{pL^{yk)) for each x which satisfies |la;|| > Vh- Therefore ||pL^(i/fc)|| < 
z/fc. Thus ah is bounded from above by 

dh '■= ((1/ Lh)Ak + zzfc + r'fc-i + 4si + {l/th)p). (21) 


3. The convergence theorem 


The proof of the main convergence theorem (Theorem 3.6 below) is based on several 
lemmas. The hrst one is a generalization of [9, Lemma 2.3] to the case where H is 
inhnite dimensional and g is lower semicontinuous. A large part of the proof is similar 
to [9, Lemma 2.3] and hence we decided to put it in the appendix. 

Lemma 3.1. Suppose that y E H and L > 0 satisfy F{pL{y)) < QL{PL{y),y) where Ql 
is defined in (7) with L instead of Lk- Then for all x E H 

F{x) - F{pL{y)) > 0.5L\\pL{y) - yf + L{pLiy) -y,y-x). (22) 

Remark 3.2. The definition of Lh and Remark 2.1 imply that we can use Lemma 3.1 
with y = yhi L = Lh, and an arbitrary x E H. 


The next lemma is perhaps known and its proof is given for the sake of completeness. 


Lemma 3.3. Let {X, || • ||) be a real normed space and let G ■. X ^ (—oo, cxo] he convex. 
Let B <Z X be a closed hall with radius vb and center a E X and let B' he any closed 
ball containing B with the same center a and with a radius rs' > vb- Suppose that there 
exist real numbers mB' < Mb' such that mB' < G{x) < Mb' for all x E B'. Then G is 
Lipschitz on B with a Lipschitz constant 

A := {Mb' - mB')/{rB' - lb). (23) 


Proof. The proof is closely related to the proof of [91, Theorem 2.21, p. 69]. Let x,y E B 
be arbitrary. Denote r := vb, r' := vb' > r. Let z := y+ {{r' — r)/\\y — x\\){y — x) ii x ^ y 
and z := y = X otherwise. Then y = \z + {1 — X)x where A := |]a: — i/|]/(||x — y\\ + r' — r). 
Since A E [0,1], the convexity of G implies that G{y) < XG{z) + (1 — A)G(x). Since y E B, 
the dehnition of z implies that ||z — a|| < \\y — a|| + r' — r < r'. Therefore z E B'. The 
above inequalities, the dehnition of A, and the fact that G{x),G{y) E M (since x,y E B') 
imply the inequality 


G{y) - G{x) < A(G( 2 ) - G{x)) < X{Mb' 


mB') < 


{Mb' 


mB')\\x - y\\ 




(24) 
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By interchanging the role of x and y we obtain 

{Mb' - mB')\\y - x\ 


G{x) - G{y) < 


( 25 ) 


Since x and y were arbitrary points in i?, it follows that G is Lipschitz on B with a 
Lipschitz constant given in (23), as claimed. □ 


Remark 3.4. As shown in [5, Proposition 7.8] for the case where X is Hilbert, bound¬ 
edness of G on balls (hence on bounded subsets) is equivalent to G being Lipschitz on 
bounded subsets and also equivalent to the existence and uniform boundedness of the 
subgradients of G on bounded subsets. In the finite dimensional case all of these condi¬ 
tions always hold as a corollary of [91, Theorem 5.23, p. 70], but the counterexample 
given in [5, Example 7.11, p. 413] shows that in the infinite dimensional case they do not 
necessary hold. 


In order to formulate Theorem 3.6 below, we need the following definition. 
Definition 3.5. F is said to he double bounded if it is bounded on bounded subsets of H. 


Theorem 3.6. In the framework of Section 2, suppose that one of the following two 
possibilities hold: either we are in the backtracking step size rule, and then the optimal 
set of F is nonempty and we fix an arbitrary minimizer x in the optimal set, or we are in 
the constant step size rule, and then we fix an arbitrary x E H for which F{x) is finite. 
Then for all k > 1 


F{xk+i)-F{x) < 
If, in addition. 


2t (^{2/Li)ti{ti - l){F{xi) - F{x)) \\t 1 y 2 - {h - l)xi - x\\^ + Yl’jtl 


{k + iy 


yk+i 
Z^ 7=2 


lim , i = 0, 


(26) 


(27) 


fc—>-oo {k 1)^ 

then limk^oo iF{xk) = ixiinF. In particular, the above holds when F is double bounded 
and also when F is not double bounded but = 0 for all k >2. 


Proof. During the proof all the relevant expressions will be derived. In particular, there 
will be no use of the specific form of yk until (46) (only the existence oi y^ E H which 
satisfies (5) or (11) will be assumed) and no use of of the specific form of tk+i until (48) 
(the existence of tk will not even be assumed until (48)). For the sake of convenience, 
the proof is divided into several steps. 


Step 1 : Fix an arbitrary k > 1 . Let Bk+i := B[xk+i,Si], := B[xk+i,2si\ and 

Vk ■= F{xk) — F{x). Either F is bounded on and then F{xk) is hnite, or F is not 
bounded there and then e^, = 0 from (17) (or (18)). This, together with Remark 2.1, 
implies that if in addition k > 2, then also in this case F{xk) is finite. Since we always 
assume that F{x) is finite it follows that Vk+i is finite for all k E N. 

From now until the last paragraph of this step (excluding) assume that F is bounded 
on At the end of the step we will deal with the second possibility. Let mk+i and 

Mfc+i, mk+i < Mfc+i be any lower and upper bounds of F on respectively, and let 
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Afc+i := (Mfc_|_i — mk+i)/si. Since ||efc+i|| < Si < 2si and since Lemma 3.3 implies that 
F is Lipschitz on with a Lipschitz constant A^+i, we have 

Afc+i||efc+i|| - F{xk+i) > -F{xk+i - Cfc+i). (28) 

Thus, by substituting x = y = i/fc+i, L = Lk+i in (22), using the definition of 
using (11) (or (5)), Lemma 3.1, and using (28), we obtain 

2(nfc T Afc_|_iIICfc+iII Vk+]^/Lk-\-\ ^ 2,(^F{xk') FiFk+i Cfc+i))/-hfc+i 

— Il^fe+l (^k+l l/fc+l|| T 2(Xfc-|-i dk+l Vk+l) Vk+l ^k') ■ (29) 

By substituting x = x, y = yk+i, L = Lk+i in (22) and using (11) (or (5)), Lemma 3.1 
and (28), we obtain 

2(Afc+i||efc+i|| —Vk+i)/Lk+i 

> ‘^[F{x)-F[xk+i-tk+\))!Lk+\ > |kfc+i-efc+i-yfc+i||'^+2{xfc+i-efc+i-yfc+i,yfc+i-5). 

(30) 

Now we multiply (29) by a nonnegative number (to be determined later) and add the 
resulting inequality to (30). We have 

(2/ Lk+i){'ykVk ~ (1 + 'lk)vk+i) + (2/Lfc+i)(l + 7fc)Afc+i||efc+i|| 

^ (dfc T l)||^fc+l Cfc+l l/fc+l|| T 2(Xfc-|_i dk+l yk+li'^kiyjk+l ^fe) T 2/fc+l Fj 
= (bfc + l)||3;fe+i ~ i/fc+i||^ + (7fc + l)l|6fc+i||^ ~ 2(efc+i, ( 7 fe + l)(xfc+i — yk+i)) 

T 2j{xk+\ yk+ii 7fc(^fc+i ^k') T 2/fc+i Fj 2(efc-|_i, 7fc(2/fc+i 2)^) T ^fc+i Fj■ (31) 

Now we multiply (31) by some nonnegative number 5k (to be determined later). We have 

/Lk+l){5k'yk'^k *^A;(1 T 7fc)l^A;+l) T (2/Lfc_|_i)hfc(l T y/j) A/j_|_i ||Cfc-i-i || 

> 11(4(1 + lk)f'^{xk+i - 2 /fc+i)||^ + 24 (xfc +1 - 1 /fc+i, (1 + '-)k)yk+i - {ikXk + x)) 

+ 4(1 + yfc)||efc+i||^ - 24(efc+i, (1 + yfc)xfc+i - (y^Xfo + x)). (32) 

So far we assumed that F is bounded on However, if it is not bounded there, then 

according to (17) or (18) we have e^+i = 0, and then (28)-(32) still hold (with arbi¬ 
trary Afc+i G M), again because of Lemma 3.1 and the same simple algebra. The above 
inequalities also hold, trivially, when Vk = oo (can happen only when k = 1). 

Step 2: In order to reach useful expressions, we want to use the simple vectorial identity 

II& — a||^ -I- 2{b — a,a — c) = \\b — c||^ — ||a — c||^, (33) 

which seems related to the right hand side of (32) (if we ignore for a moment the terms 

involving perturbations). In order to use it, we impose additional assumptions on the 
sequences {■jk)’^=i and (4)^i (in addition to non-negativity): 

1 + 7A: = (4(1 + 7A:))° ^ = 4, V/c > 1. (34) 

Fortunately, these three equations are consistent and once we assume (34), substitute 

a = 42/fc+i, b = 6kXk+i, c = 'jkXk + x (35) 
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in (33), use the Cauchy-Schwarz inequality, and use (32), we obtain 

(2/Lfc+i)(4(4 - l)wfc - ^Ivk+i) - (5fc||efc+i|P 

+ (2/Lfc+i)(5fcAfc+i|lefc+i|| + 25^||efc+i||||a:fc+i - ((4 - l)/5k)xk - x/5k\\ 

> 1142:^+1 - {ikXk + 5)11^ - W^kVk+i - hkXk + 5)ir (36) 

With the notation 

efc+i := 2(5^||efc+i||((Afc+i/Lfc+i) + \\xk+i - ((4 - l)/4)a:fc -5/411) (37) 

and the fact that —(5^|lefc+i|p < 0 we obtain the inequality 

(2/Lfc+i)(4(4 - l)vk - Slvk+i) + ek+i > ||4a:fc+i - {7kXk + x)\\^ - ||4i/fc+i - {7kXk + x)\\^. 

(38) 

Now there are two possibilities: if we are in the constant step size rule, then 
and we obtain from (38) that 

(2/Lt)4(4-l)Kt-(2/Ls+,)(5X'+i+«t+i > \\SkXk+-i-(lk^k+i)\?-\\6tVk+i-('lkXk+i)\\'‘■ 

(39) 

If we are in the backtracking step size rule, then F{xk) > F{x) and hence Vk > 0. Since 
also 4 — 1 = 7 fc > 0 and > L^, we obtain (39) again from (38). 


Step 3: We want to represent the non perturbed term in the left hand side of (39) as 




(40) 


for some sequence of positive numbers (a^)^^, and to represent the right hand side of 

(39) as 

lk;.+i||^-|Kir- (41) 

for a sequence of vectors {wk)’^^i- The reason for doing this will become clear later (see 
(51) and the discussion after it). For obtaining (40) we impose the condition 

4+i(4+i-1) = 5L VA:>1. (42) 


It leads to (40) with 




(43) 


Step 4 : For obtaining (41) we impose some conditions on the sequence {yk )’^2 
we only assumed the existence of yk E H satisfying (5) or (11) but not its form). The 
condition is that with 


lllfc • 42/fc+l i^'JkXk T (t), 

we will have 

4‘1'fc+l i^'ykXk T x')^ 

Thus, from (34),(44),(45), 


'ik > 1 

(44) 

ik > 1. 

(45) 


yk +2 


Wk+1 + {^k+lXk+1 + x) 

4+1 


(43^fc+l — ilkXk + x)) + 'Jk+lXk+1 + X 

4+1 

(4+1 T 4 f')Xk+l (4 f^Xk 
4+1 


'ik > 1. (46) 
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Step 5: We still need to find 6k and ■jk- After solving the quadratic equation (42) for 
and taking into account the asumption > 0 for all fc > 1 , we obtain 


6k+i — 


1 + ^/l + iSl 


'ik > 1 . 


(47) 


The only restriction on is that > 1 so that 71 > 0 because of (34). There is no 
restriction on 7 / 2 • Once we choose 1/2 and (5i we obtain 7 ^. from (34) and see that indeed 
7 fc > 0 and <5^ > 1 for all k. The equalities and inequalities mentioned earlier indeed hold 
from the construction of 6k- By denoting 


tk:=6k-i, 'ik>2 (48) 

we derive the expression mentioned in (9). From (46) we derive the specihc form ( 8 ) 
(and ( 12 )) of yk+i- 


Step 6: Now, by induction we obtain from (37) and (39)-(44) that 

Ofc + ||lL'fc|P + ^k+l ^ Ofc+1 + ||lL'fc+l|P) V/c > 1. (49) 

This implies that the sequence ( 0 *, + of real numbers is decreasing up to a small 

perturbation. From the inequality ||efc+i|| < Si, (11) (or (5)), the triangle inequality, 
( 17 ), the assumption that ||x|| < /i, (37), and (48) it follows that for all /c > 1 

II ((Afc+i/Tfc-|_i) T ||xfc-|_i ((t/j_|_i l)/t/j_|_i)xfc (1/ffc-|_i)xII) 

< 2t^_,_;^||efc+i|| ((Afc_|_i/Lfc+i) + ||xfc+i — ((tfc+i — /tk+i)xk\\ + (1 Afc+i)h + 2si) 

< 2f^_,_;^||efc+i|| ((Afc_|_i/Lfc_|_i) + lixfc+ill + lixfcll + 2si + {l/tk^i)y) 

< 2f^_,_;^||efc+i|| ((Afc+i/ Lk+i) + ||pLfc+i(i/fc+i)|| + Si + |bLi,(2/fc)|| + si + 2si + (l/4+i)/i) 

= ||efc+i||afc+i < Sk+i- (50) 

If (18) holds instead of (17), then similar considerations show that ek+i < s^+i (the 
third line in (50) is replaced by ||efc+i||cr(,^^ < s^+i). Therefore, using (49), 


k-\-\ k 

Ol + ||lCi 11^ + Sj > ai + ||tci||^ + E ej+i > flfc+i + ll^fc+iT > Ofc+i, V/c > 1. (51) 

i=2 j=i 

The above implies, using (43), that for all A: > 1 

k+l 

ai + lltcill^ + '^Sj> ttk+i = 24 + 2 ( 4+2 - l){F{xk+i) - F{x))/Lk+i. (52) 

i=2 

From (42),(48), and (52) it follows that for all A; > 1 


2F 

^‘'k+1 


Lk+i (^(2/Li)4(4 - l){F{xi) - F{x)) + \\t 2 y 2 - ((A 2 - l)a;i + x)|P + s. 

2tf;^ 


(53) 
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Step 7: From (47),(48), and simple induction it follows that 

4+1 = 4 > 0.5(fc +1), Vfc>l. (54) 

This inequality, (53) and (14) yield 


F{xk+i) - F{x) < 


Lk+l{0‘l + ll'^L'llI^ + 'Yl,j=2 4 

^^fc+1 


< 


2r (^(2/Li)t2(4 - l)(4(xi) - F{x)) + \\t 2 y 2 - ((4 - l)xi + x)f + Y.]!] 4) 


{k + lf 


(55) 


Step 8: It remains to show that under the assumption (27) we have lim^^oo-^(xfc) = 
infji/ F. Recall again that either we are in the backtracking step size rule and then x is 
a minimizer of F or we are in the constant step rule and then x is arbitrary. In the first 
case (55) and the inequality min F = F{x) < F{xk) for all k imply the assertion. In the 
second case we conclude from (55) that for all e > 0 and for all k sufficiently large 


F{xk) < F{x) + e. 


(56) 


Now there are two possibilities: if infjif F = — 00 , then (56), combined with the fact that 
X was an arbitrary point in H, imply that limk^oo F{xk) = —00 = inf hF, as claimed. 
Otherwise, we can take x E H such that F{x) < inin F + e and we conclude that 
F{xk) < infu F + 2e for all e > 0 and all k sufficiently large. This and the inequality 
infj(/ F < F{xk) for all k imply that hmfc_).oo F{xk) = inf^ F. □ 


Corollary 3.7. Under the setting of Theorem 3.6, if there exists a real number r such 
that Sk = 0(1/k^) for each k>2, then 


O(^), ifre(l,cx)). 


F{xk) - F{x) = < 


O 

O 


ihh) ifr = l 
^). if >-6 1-1.1). 


(57) 


Proof. By our assumption there exists c > 0 such that Sj < cf f for all j E N. If r G (0,1) 
or r > 1, then 

< c^j-^ < / u-'^du = 


c{k^-^ - 1) 


i=2 j=2 


7=1 


1 — r 


If r = 1, then 


fc+i 


k-l 


/ J + i 

u ^du = cln(fc). 

j = i 


If r e [—1, 0], then 


k-\-l k-\-l 

u-^du = 

j=2 j=2 F 


c((fc + l)^-"-2^-4 
1 — r 
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By taking into account the above expressions and (55) (including the constant terms in 
the numerator of (55)) we obtain the assertion. □ 


Corollary 3.8. Under the assumptions of Theorem 3.6 without assuming (27), we have 

||efe||<^, VfceN. (58) 

If, in addition, the following four conditions hold: 

(I) limk^oo F{zk) = oo if and only if is an arbitrary seguence in H satisfying 

hm/^_^oo ll-^fcll oo, 


(II) 

There exists c G [0,1] such that for all k 

G N, /c > 2, z/cfc 7 ^ 0 and (17) holds. 

then 




CSk 

(59) 


and if Ck yl 0 and (18) holds, then 





lle.ll > 

CSk 

(60) 

(III) 

inf{F(a:) : x G H} > —oo. 




(IV) 






sup 1 

T.T=lsy 
{k + iy ■ 

k G < oo. 

(61) 

then, _ 

for each k >2, either = 0 

or 





l|e.||=0 

(f)' 

(62) 


i.e., either Ck = 0, or, up to a multiplicative constant factor (independent of k) from above 
and below, ||efc|| behaves as Sk/k"^. In particular, if Conditions (I)-(IV) hold and if there 
exists a; G M such that for each k >2 either = 0 or 


then 



CO > 1, 


F{xk) - F{x) 



u G (3, oo) 
CO = 3, 

CO G [1,3). 


(63) 


(64) 


Proof. If (17) holds, then from (17) and (54) we obtain that ||efc|| < 0.5sfc/(sifc^). If 
(18) holds, then from (18) and (54) we have ||efc|| < Sfc/(si/c^). As a result, in any case 

(58) holds. 

Assume now that also the other conditions (I)-(IV) hold. From (26), Condition (IV), 
and Condition (III) it follows that the sequence {F{xk))'^=i bounded. This and Con¬ 
dition (I) imply that there exists M > 0 such that 


M > max{4si, 2/i}, and |lxfc|| < 0.5M V/c > 1, 


(65) 
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where /i is any upper bound on ||x||. This and the triangle inequality show that the closed 
ball B[xk, 2si] is contained in 5[0, M]. Since F is bounded on bounded sets as implied by 
Conditions (I) and (HI), there exists A > 0 such that < A for all k >2, where A^ is 
dehned in (15). In addition, the above and (5) (or (11)) and (17) (or (18)) imply that 

lbL,(i/fc)|| < 0.5M +si. (66) 

Since tk < tik for all > 1 as implied by a simple induction, it follows from Condition 
(11), from (65), from (14), from (16), from (66), and from (59) that for all /c > 2, 
either = 0 or 

“ 2f2((l/p)A + 1.5M + 6 si)F‘ 

It follows from (58) and (67) that for each k > 2 either = 0 or ||efc|| = Q^Sk/k"^), 
as claimed. Similar things can be said if (18) and (60) hold instead of (17) and (59) 
respectively. 

Finally, assume that Conditions (I)-(IV) hold and that for each k > 2 either 6^ = 0 
or (63) hold. From what proved above this implies (62). From this and (63) we have 
Sk = Q{l/k‘^~^). Because cu > 1, elementary computations (as in the proof of Corollary 
3.7) show that (61) is not violated. From Corollary 3.7 we conclude that (64) holds. □ 


Remark 3.9. (i) When t 2 = I and Sk+i = 0 for all k > 1, then (26) implies that 


F{xk+i) - F{x) < 


2 r |||/2 


x\ 


{k + iy 


as in [9, Relation (4.4)], up to the index value (there the index k starts at 0) and up 
to the fact that yi in [9] is not assumed to be arbitrary as 1/2 here but is taken to be 

Xq. 

(ii) Frequently, the expression 

ri2 := 2r {{2/Li)t2{t2 - l)(R(a;i) - F{x)) + \\t 2 y 2 - ((^2 - 1)2:1 + :r)||^) (68) 

which appears in the right hand side of (26) can be bounded from above even when 
X is unknown (when it is a minimizer). For example, consider the ii-i 2 optimization 
case, i.e., F{x) = \\Ax — 6|p + ||a;||i, x E H ■= M"', A : M" —)■ M"' is linear, b G M"''. 
As explained in Remark 2.6, we have ||x|| < ||6|p. Since F{x) > 0 we conclude from 
the triangle inequality and the above discussion that 


ri2 < 2r ((2/Li)t2(t2 ~ l)A’(a;i) + {\\t 2 y 2 — ((^2 — l)2:i|l + ll^lP)^) • 


Remark 3.10. Interestingly, despite the difference in the various notions of inexactness 
and the algorithmic schemes considered here and elsewhere in the literature, (64), as a 
function of the decay in the error parameters, was obtained in [50, Theorem 2.1] and [92, 
Theorem 4.4] (note: in [92] the decay in the error parameters is as e| because of [92, 
Dehnition 2.1]). In [82, Proposition 2] and the discussion after it the error parameters 
were assumed to decay faster in order to achieve (64), e.g., an 0{l/k‘^) decay for an 
0{l/k‘^) decay in the function values. The algorithmic schemes described in these works 
include FISTA as a particular case. In [33, p. 62] a slightly better decay rate is given 
in which boundary cases are allowed. For instance, an 0{l/k^) decay implies an 0(l/fc^) 
decay in the function values while we require a 0(1 /decay for arbitrary (3 > 0. 
However, as mentioned at the end of Subsection 1.1, the setting in [33] is somewhat 
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different from our one, especially when a separable function is considered. Interestingly, 
even in [62], which, as explained in Subsection 1.1, also considers a different setting from 
our one, one can hnd traces of the decay rate 0{l/k^) of the errors: see [62, Proposition 
5.2(c)]. 

The above discussion leads us to conjecture that there are some non-obvious relations 
between the various notions of inexactness. In fact, [80, Proposition 2.5] and the discus¬ 
sion after [80, Dehnition 2.1] shows that our notion of inexactness may be weaker than 
the one discussed in [92]. On the other hand, because in Corollary 3.8 we impose Con¬ 
ditions (i)-(iv), we assume something which is not assumed in [92] and in other works 
mentioned above (Corollary 3.8 is especially good for the case of superiorization because 
in this case the user actively controls the errors). We also suspect that there are examples 
for functions F such that ||efc|| = Q{l/k‘^) for hxed oj G (0,1) but Imik^ao F{xk) does not 
exist or it exists but is not equal to F{x) (assuming x is a minimizer of F). 


4. Superiorization 

4.1. Background. In Section 1 we mentioned briefly the superiorization methodology as 
one of the reasons for considering inexact versions of FISTA. Motivated by this reason, we 
re-examine in this section the superiorization methodology in a thorough way and show 
that its scope can be signihcantly extended. 

First, let us recall again the principles behind the superiorization methodology. Suppose 
that our goal is to solve some constrained optimization problem. The full problem might 
be too demanding from the computational point of view, but solving only the constrained 
part (the feasibility problem) can be achieved by an algorithm A which is rather simple 
and computationally cheap. Suppose further that A is known to be perturbation resilient, 
that is, a perturbed version A! of A due to error terms also produces solutions to the 
constrained part. The superiorization methodology claims that often we can do something 
useful with the perturbed version. The “something useful” can be a solution x' (or an 
approximation solution) to the feasibility problem which is superior, with respect to some 
given cost function 0, to a solution x which would be obtained by considering the original 
algorithm A. In other words, (j){x') < (j){x), and frequently (j){x') is much smaller than 
4>{x) or at least the computation time needed to hnd x' will be smaller than the one 
needed to hnd x. A possible way to approximate x' is by performing in each iteration a 
feasibility seeking-step and immediately after it a superiorization step aiming at reducing 
(j) at the current iteration by playing carefully with the error parameters. 

This heuristic methodology was officially introduced in 2009 in [30], but historically, 
the hrst works in this research branch are the 2007 paper [13] and the 2008 paper [45] 
which did not use the explicit term “superiorization”. Since then, the methodology has 
been investigated in various works, e.g., in [7,20-23,28,29,46,51,52,56,71,72]. See 
also [19,44] for two recent surveys and [18] for a continuously updated online list of works 
related to the superiorization methodology. Although the point x' is not a solution to 
the original constrained optimization problem, promising experimental results discussed 
in many of the above mentioned works show the potential of superiorization in real- 
world scenarios (for instance, for the analysis of images coming from medical sciences and 
machine engineering). 
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However, from the theoretical point of view the methodology is still in its initial stages. 
In particular, the few mathematical results that exist do not give a full theoretical justih- 
cation of its success. As a matter of fact, even the potential scope of the methodology has 
not been fully investigated. So, on the one hand, some of these works (e.g., [19,28,44] 
and [20]) show that the pioneers of this methodology have definitely been aware of the 
generality of the approach, but, on the other hand, a more careful reading of these works 
(e.g.. Definition 4, Algorithm 5, and Dehnition 9 in [19]) show that the actual setting 
which has been considered is not completely general. 

To be more concrete, the setting is a real Hilbert space H (usually finite dimensional); 
the perturbed iterations should have the form Xk+i = Tk{xk + hvk) for some operator 
Tk : H ^ H, where (ufc)^i is a bounded sequence in H and (/3fc)^^ is a sequence of 
nonnegative real numbers satisfying < cxo; if a convergence notion is discussed, 

then this notion is standard: mainly strong convergence (rarely, as in [22], also conver¬ 
gence in the weak topology); in several places, e.g., [20], there are limitations on the 
considered functions (e.g., 0 must be convex); the algorithmic operator used at iteration 
k + 1 depends only on iteration k (and possibly on some parameters depending on k) but 
not on previous iterations such as both iterations k and fc — 1, as, e.g., in the perturbed 
version of FISTA (Section 2). 

Moreover, in all the works related to superiorization that we have seen, the perturba¬ 
tion resilience property of the algorithm A mentioned above has been understood in the 
feasibility sense and not in other contexts (e.g., in a context of hnding a superior solution 
to an unconstrained minimization problem using a perturbation resilient algorithm). In 
other words, the perturbation resilience property is understood in the sense that both 
the sequence produced by A and its perturbed version produced by A! should converge 
to a feasible solution. A frequently used version of this criterion is to use a proximity 
function which measures the distance to the feasible set, and a solution is a point in 
the space in which this proximity function attains a value not greater than some given 
error parameter [19,28,29,38]. The above is consistent with the fact that often the 
superiorization methodology is described as lying between optimization and the (convex) 
feasibility problem: see, e.g., [19,20,23,71] and [28, p. 90]. 

4.2. Our contribution. What is suggested here is to extend the superiorization prin¬ 
ciple by allowing any type of perturbations, any notion of inexactness, any notion of 
convergence, and any type of optimization-related problem. More precisely, given any 
optimization-related problem in some given space, suppose that we have in our hands a 
notion of an algorithm A which produces a sequence of elements in the space (they can 
be thought of as being intermediate solutions to the problem) and a notion of a solution 
of the problem (e.g., the limit of the sequence or some intermediate solution satisfying a 
certain termination criterion). Moreover, suppose that we have in our hands a notion of 
inexactness (or a notion of perturbation) of the algorithm, so that instead of considering 
the sequence produced by A we consider a sequence produced by a perturbed algorithm 
A!. If there is a mathematical result saying that any perturbed sequence (according to 
our notion of inexactness) also induces a solution to the original problem, then we can 
consider the set of all perturbed sequences, with the hope that we will be able to hnd in 
this set, by one way or another, a sequence which will lead us to a superior solution. 


A NEW CONVERGENCE ANALYSIS 


19 


Roughly speaking, a “superior solution” means a solution to the original problem which 
is better, according to some criterion (preferably a criterion which is quantitative and 
simple to apply), than “standard solutions”, namely, solutions which are found using the 
algorithm A. This additional criterion can be thought of as being “a notion of superiority”. 
For example, the notion of superiority can be based on a given cost function (p. In this 
case, if {xk)k is the sequence produced by the original algorithm A with an induced 
solution X, and if is the perturbed sequence having x' as the induced solution, then 
x' is considered as being a superior solution to x if (t){x') < (t){x). Alternatively, we can 
say that x' is superior to x whenever 0(x^) < 0(xfc) for all k large enough. In both cases 
strict inequalities are preferred. When the original problem is to minimize a function F 
under some constraints, then a possible choice for p is to take p := F. A third superiority 
criterion is to consider several cost functions pi, i ^ I for some nonempty set of indices 
I, i.e., pi{x') < pi{x) for all z G /, or at least that ppx'P) < ppXk) for alH G / and all k 
large enough. A simple illustration for this third criterion is to take / = {1,2}, X = M”, 
pi : X ^ [0, cxo) as the total variation and p 2 '■ X ^ [0, cxo) as the penalty function p 
suggested in [57] (see also [38, p. 166]). 

In practice the perturbed sequence (x).)^ will be determined by some error terms (which 
can be vectors, positive parameters, etc.). No matter how we play with these error terms, 
as long as they satisfy the conditions of the perturbation resilient result that we have 
in our hands, we obtain a sequence which is guaranteed to converge in some sense to a 
solution of the problem. However, by a clever modihcation of the error terms in each 
iteration we may steer the sequence to a superior solution. 

The examples below show the wide spectrum of this general principle (virtually, any 
optimization-related problem can be considered), thus significantly extending the scope 
of the original superiorization methodology. In order to simplify the notation below, we 
refer to the error terms as when they are vectors and when they are positive numbers 
(although in the original works a different notation was sometimes used). 

Example 4.1. Optimization problem: (accelerated) minimization of a convex function 
in hnite and inhnite dimensional Hilbert spaces. Notion of convergence: non-asymptotic 
(function values). A few notions of inexactness: see the details regarding Devolder et 
al [33], Jiang et al. [50], Monteiro-Svaiter [62], Schmidt et al [82], and Villa et al. [92] 
in Section 1 above; see also (5) and Theorem 3.6 above. 

Example 4.2. Optimization problem: hnding zeros of (nonlinear, maximal monotone) 
operators. Notion of convergence: weak or strong topology. A few notions of inexactness 
and settings: 

• Rockafellar [79]: ||xfc+i - Pfc(xfc)|| < or ||xfc+i - Pk{xk)\\ < efc||xfc+i - Xfc||, where 

Cfc < cxo and Pk = (/-t-CfcT)“^ is a proximal operator induced by the operator 
T whose zeros are sought and Ck > 0. Setting: a real Hilbert space. 

• Eckstein [34]: Xh{xk)+ek G Vh(xfc+i)-f-CfcT(xfc+i) for a given Bregman function h, 
where both YlT=i ll^fcll < °° J2T=i{^k,Xk) should exist and be hnite. Setting: 
the Euclidean M”. 

• Solodov-Svaiter [86]: here the goal is to hnd a zero of the operator T in a real 
Hilbert space under a linear constraint. The perturbation appears in several forms: 
hrst, in an Cfc-enlargement of T; second, in a certain inequality involving e*,, Xk, 
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and other components of the algorithm (inclnding a relative error tolerance cr^); 
third, in an “halfspace-type projection” involving e*,. 

• Reich-Sabach [75]: here the goal is to hnd a common zero of hnitely many op¬ 

erators Ai, i G {1, ..., in a real reflexive Banach space. There are two types 
of perturbations. The hrst type appears in [75, (4.1)] in four places. The hrst 
place is in the equation el = Ck + - ^f{xk)) where G Ai{yi), 

the second is in the term wl = Vf* {Xlel+ 'Vf{xk)), the third is in the set 
Cl = {z ^ X Df{z,yl) < Df{z,wl)} via wl, and the fourth is in the set 
Ck := nfLiCl- Here / is a Bregman function, Df is the induced Bregman diver¬ 
gence (Bregman distance), is a positive parameter, f* is the convex conjugate 
(Fenchel conjugate) of /, and yl is an additional term satisfying certain relations. 

The second type appears in [75, (4.4)] in three places. The hrst place is in the 
term yl = Res{i (xk + el) where / is a Bregman function, Al is a certain positive 

parameter, Xk is determined in other steps of the algorithm, and Res{i„ is the 

resolvent of the operator A^Tj relative to /. The second place is in the dehnition 
of a certain subset Cl dehned in an intermediate step of the algorithm and the 
perturbation appears as Xk + el inside the dehnition of xl- The third place is in 
the set Ck '■= Af^^Cl- The error terms el can be arbitrary (this issue has been 
clarihed recently and will be discussed elsewhere). 

Example 4.3. Optimization problem: hnding hxed points of nonlinear operators in real 
rehexive Banach spaces. Notion of convergence: weak or strong topology. Some examples: 

• Reich-Sabach [76]: here the goal is to hnd a common hxed point of hnitely many 
operators Tj, i G {1,... , N}. The perturbation comes in two forms: hrst, as yl = 
Ti{xk -l- el) where Xk is determined in other intermediate steps of the algorithm. 
Second, the perturbation also appears (as Xk + el) in the dehnition of a certain 
subset Cl dehned in an intermediate step of the algorithm. The error terms el can 
be arbitrary (this issue has been clarihed recently and will be discussed elsewhere). 

• Butnariu-Reich-Zaslavski [14]: here several notions of inexactness are used. These 
conditions are equivalent to saying that four sequences {ei^k)T=n i G {1)2, 3,4} of 
nonnegative numbers are given and we assume that their sum is hnite; now, for 
each k E N the iteration Xk+i is an arbitrary vector which satishes the following 
inequalities: Df{T{xk),Xk+i) < ei^k, \\f'{T{xk)) - f'{xk+i)\\ < € 2 ,^, \\f'{T{xk)) - 
/'(a;fc+i)||||T(a;fc)|| < 63 ,fc, and {f'{xk+i) - f'{T{xk)),Xk+i -T{xk)) < 64 ,fe. Here 
T is the operator whose hxed point are sought and Df is a. Bregman divergence 
(distance) with respect to a given Bregman function /. 

Example 4.4. Optimization problem: minimization of a real lower semicontinuous proper 
convex function /. We mention here two examples: 

• Cominetti [27]: The notion of convergence is weak or strong. Notion of inexact¬ 
ness: Xk-{l/Xk)xk-i E dej{xk,rk), where is an efc-subdiherential (of f{-,rk)), 
/(•,•) is (by abuse of notation) a proper convex lower semicontinuous approxima¬ 
tion of / depending on Xk, Xk > 0 , and > 0 and has the property that its 
minimal value is hnite and tends to the minimal value of / (whose set of minimiz- 
ers is assumed to be nonempty) as r > 0 tends to 0. There are a few conditions 
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on some parameters, e.g., in [27, Theorem 3.1] one requires that hmfc_j.ooLfc = 0, 
= OO, < OO or hmfc^ooefe//9(rfc) = 0, and there are ad¬ 

ditional conditions; here /S(rfe) > 0 is a strong convexity parameter of f{-,rk). 
Setting: a real Hilbert space. 

• Zaslavski [95]: two notions of convergence are used: in the hrst [95, Theorem 
1 . 2 ] the notion is that the distance of Xk from the solution set is smaller than a 
given error parameter e > 0. The second notion of convergence is convergence 
in the function values. The notion of inexactness in both cases has the form 
Xk + Gk = argmin 2 ,g]g„(/(a;) -|- {l/Xk-i)B{x,Xk-i)) for some Bregman divergence 
B and a relaxation parameter Afc_i > 0, /c G N. In addition, it is assumed that 
there exists 5 > 0 depending on e such that ||efc|| < S for each /c G N. Setting: the 
Euclidean M"'. 


Example 4.5. Optimization problem: a generalized mixed variational inequality prob¬ 
lem in a real Hilbert space (Xia-Huang [93]). Notion of convergence: weak. Notion of 
inexactness: based on error terms whose magnitude should be small enough so that it 
satishes a certain implicit inequality [93, Relation (3.3)] which is also determined by some 
parameters given by the user including a relative error parameter a. 


Example 4.6. Optimization problem: Ending attracting points of an infinite product of 
countably many nonexpansive operators Tj, z G N (Pustylnik-Reich-Zaslavski [73]) in a 
complete metric space. Notion of convergence: the distance between the iterations and 
the attracting set F tends to 0. Notion of inexactness: for each e > 0 there exists 5 > 0 
and a natural number Uq such that for each “good” control r : { 0 , 1 , 2 ,...} ^ { 0 , 1 , 2 ,...} 
and each sequence {xk)^=Q satisfying d{xk+i,Tr(k)Xk) < S for each k G {0,1,2,...}, the 
inequality d{xk, F) < e holds for every k>nQ. 


Example 4.7. Optimization problem: solving the (convex) feasibility problem. Many 
examples are given in papers dealing with superiorization. Here we mention examples 
which seem to be less familiar in the superiorization literature. The notion of inexactness 
in them is weak or strong. 


De Pierro-Iusem [31]: the perturbation appears as Xk+i = Xk- 


G^ki^9i{k)iXk) T Gk^ 


tk 


when gi(k){xk) > 0 ; here tk ^ 0 is a subgradient of the convex function at 
the point Xk and ak is a relaxation parameter. It is assumed [31, Section 3.1] 
that (efc)^i is a monotonically decreasing sequence of positive parameters which 
converges to zero and satisfies the condition Gk = oo. Setting: the Euclidean 


• Censor-Reem [22]: 


the perturbation has the form Pq { Xk — Xk 


gi{k) {Xk) 

4 IP 


4 + Gk 


whenever gnk){xk) > 0 ; here 7 ^ 0 is a zero-subgradient of the zero-convex func¬ 
tion at the point Xk and A*, > 0 is a relaxation parameter, and Pq is the best 
approximation projection on the nonempty closed and convex subset on which 
the functions gj, j G N are defined. There are additional assumptions, among 
them [22, Condition 1] saying that for each /c G N the norm of the error term Ck 
is bounded above by min{/i, eie 2 h^/( 2 ( 5 /i-|-4hfc))}, where p, ei, and 62 are certain 
given positive parameters and hk is a certain nonnegative parameter depending on 




22 


DANIEL REEM AND ALVARO DE PIERRO 


other parameters (e.g., on gi(k){xk))- For a slightly different type of perturbation, 
see [22, Subsection 8.1]. Setting: a real Hilbert space. 

Example 4.8. Optimization problem: any problem which makes use of relaxation pa¬ 
rameters (as in many of the above examples). These parameters can also be thought of as 
“resilience error parameters” since it is guaranteed that the various algorithms converge 
whenever the parameters satisfy a mild condition (e.g., being in the interval (e, 2 — e) 
for some arbitrary small e G (0,1)). It is well-known that the relaxation parameters can 
signihcantly influence the speed of convergence of the algorithm (for a simple illustration 
of this phenomenon, see [22, Section 7]). 

Many additional examples can be found in the following rather partial list of references 
and in some of the references therein: [1,2,11,12,25,26,32,36,41,43,47-49,53,54,58,61, 
70,74,77,78,80,81,83-85,88,96-100]. Most of the above mentioned references do not 
mention the word “superiorization” explicitly. In fact, many of the involved authors had 
not even been aware of this optimization branch at the time of preparation of their papers 
(e.g., because many papers were published years before the superiorization methodology 
was introduced). However, as said above, one can find in these papers results ensuring the 
perturbation resilience of certain algorithms. One can also think about other settings in 
which the superiorization methodology can be used, e.g., when the notion of convergence 
is based on Banach limits, asymptotic centers, convergence in the sense of Mosco, etc., 
and when the optimization problems are combinatorial or mixed combinatorial (integer 
programming) and continuous. 

4.3. Concluding remarks. We want to conclude this section with the following words. 
The previous paragraphs not only extend the horizon of the superiorization methodology, 
but also pose various challenges. First, to develop a formalism which will handle the above 
mentioned examples (or at least an important class of them) in a rigorous way. Second, to 
provide various real world examples showing the usefulness of the general superiorization 
methodology. Third, to formulate theoretical and practical sufficient (and/or necessary) 
conditions which will ensure the convergence (in the considered notion of convergence) of 
the perturbed sequence to a superior solution. Fourth, to obtain results regarding rates 
of convergence (e.g., that given some approximation parameter e > 0, there exists eN 
such that for al\ < k E N iteration number k of the perturbed algorithm is an e-solution 
of the original problem). Fifth, to obtain theoretical and practical results for multiple 
cost functions (this creates an interesting and new connection between superiorization 
and feasibility, where this time a feasibility is not the target of the perturbed algorithm, 
but rather an assumption about the existence of a joint superior solution for several cost 
functions). Sixth, to present systematic methods for finding good perturbations, e.g, ones 
which will ensure that with high probability the perturbed iteration is superior to the 
unperturbed one. It is our hope that at least some of these challenges will be addressed 
and that the discussion of this section will be found to be helpful in optimization theory 
and beyond. 


5. Appendix 

In this appendix we present the proofs of a few auxiliary claims mentioned in the main 
body of the text. Lemma 5.1 below was mentioned in Remark 2.1. 
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Lemma 5.1. Given a real Hilbert space H with an inner product (•, •) and an induced 
norm || • ||, suppose that f ■. H is continuously differentiable with a Lipschitz constant 
L(/') of /'. Then for all L > L{f') and all x,y E H 

fix) < fiy) + {fiy),x -y) + 0.5L\\x - yf. (69) 

Proof. Fix x,y E H and let : [0,1] ^ M be defined by (fit) = fiy + t(a: — y)). From the 
chain rule 0' is continuous and (fit) = {f'iy+tix—y)),x—y) for each t E [0,1]. Asa result, 
the fundamental theorem of calculus, the assumption that f is Lipschitz continuous, the 
triangle inequality for integrals, and the Cauchy-Schwarz inequality imply that 

/(x) = 0(1) = 0(0) + /" (f{t)dt = fiy)+[ {f'iy + tix-y)),x-y)dt 

Jo Jo 

= fiy)+[ {fiy),x-y)dt+ [ {fiy + tix-y))-fiy),x-y)dt 
Jo Jo 

< fiy) + {fiy),x-y) + [ \{fiy + tix-y))-fiy),x-y)\dt 

< fiy) + {fiy),x -y) + [ Wfiy+ tix-y)) - fiy)\\\\x-y\\dt 

Jo 

<fiy) + {fiy),x-y)+ [ L\\y + tix - y) - y\\\\x - y\\dt 

Jo 

= fiy) + {fiy),x -y) + L\\x -y\f [ tdt 

Jo 

= fiy) + {fiy),x -y) + 0.5L||x - yf. 

□ 


Lemma 5.2 below is needed for proving Lemma 3.1. 

Lemma 5.2. Let H be a real Hilbert space with an inner product (•, •) and an induced 
norm || ■ || . For ally E H and L > 0, let u : H ^ (— oo, cxo] be defined by m(x) := Qi,(x, y), 
where Ql is defined in (7) with L instead of L^. Then u has a unique minimizer piiy) 
and there exists 7 E dgipiiy)) such that 

fiy) + i = Liy-pxiy))- (70) 

Proof. Since g is proper, convex, and lower semicontinuous, it follows from the definition 
of u and Ql that u is the sum of the smooth convex and quadratic function g(x) : = 
fiy) + {f'iy)iX — y)+ 0.5L||x — ylP and the proper convex lower semincontinuous function 
g. Hence by [6, Corollary 11.15] there exists a unique global minimizer pniy) of u. By 
Fermat’s rule [6, Theorem 16.2, p. 233] a point z is a (global) minimizer of some proper 
function G if and only if 0 G dGiz). Let G := u and 2 ; := Piiy)- Since q is differentiable, 
from [6, Proposition 17.26, p. 251] one has dqix) = {g'(x)} for each x E H. Since 
0 G dGiz), the sum rule [91, Theorem 5.38, p. 77] and its proof imply that dgiz) f 0 
and OGiz) = dqiz) -|- dgiz). The assertion follows from the above lines because q'iz) = 
ffiy) + Liz-y). □ 
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Proof of Lemma 3.1. Since we can use Lemma 5.2 in our infinite dimensional setting, 
the proof is very similar to the proof of [9, Lemma 2.3]. Indeed, £x x G iL. From the 
inequality F{pL{y)) < Ql(pl(|/),2 /) we have 

F{x) - F{pL{y)) > F{x) - Ql(pl(2/), 2 /). (71) 

Since f exists, df{x) = for each x G iL as follows from [6, Proposition 17.26, p. 

251]. From Lemma 5.2 we know that the exists 7 G dg{pL{y)) such that (70) holds. The 
above and the subgradient inequality imply the following inequalities: 

fix) > fiy) + {fiy),x-y), 

g{x) > gipiiy)) + {i,x- piiy))- 

After summing these inequalities and recalling that F = f + g we arrive at 

Fix) > fiy) + {fiy),x - y) + gipxiy)) Fi'y^x- pM)- (72) 

From (7) one has 

QLiPLiy)),y) = fiy) + {f'iy),PLiy) - y) + 0.5L\\pLiy) - yf + gipiiy))- (73) 
As a result of (70) and (71)-(73) we have 

Fix) - Fipiiy)) >{x- pLiy), fiy) + 7 ) - 0.5L||pl(2/) - yf 

= {x- pfy), Liy - Piiy))) - 0.5L||pl(2/) - yf 
= {y- Piiy), Liy - Piiy))) + (x-y, Liy - pfy))) - 0.5L||pl(2/) - yf 
= L\\y - PLiy)f + L{x - y,y - pfy)) - 0.5L||pl(2/) - yf 

= 0.5L||pl(2/) - yf + L{y - x^pfy) - y) 

as claimed. □ 
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